GOAL: Create a Classification Model that can predict whether or not a person has presence of heart disease based on physical features of that person (age,sex, cholesterol, etc...)
Complete the TASKs written in bold below.
TASK: Run the cell below to import the necessary libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
This database contains 14 physical attributes based on physical testing of a patient. Blood samples are taken and the patient also conducts a brief exercise test. The "goal" field refers to the presence of heart disease in the patient. It is integer (0 for no presence, 1 for presence). In general, to confirm 100% if a patient has heart disease can be quite an invasive process, so if we can create a model that accurately predicts the likelihood of heart disease, we can help avoid expensive and invasive procedures.
Content
Attribute Information:
Original Source: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Creators:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
TASK: Run the cell below to read in the data.
df = pd.read_csv('../DATA/heart.csv')
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
df['target'].unique()
array([1, 0], dtype=int64)
Feel free to explore the data further on your own.
TASK: Explore if the dataset has any missing data points and create a statistical summary of the numerical features as shown below.
# CODE HERE
df.isnull().info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null bool 1 sex 303 non-null bool 2 cp 303 non-null bool 3 trestbps 303 non-null bool 4 chol 303 non-null bool 5 fbs 303 non-null bool 6 restecg 303 non-null bool 7 thalach 303 non-null bool 8 exang 303 non-null bool 9 oldpeak 303 non-null bool 10 slope 303 non-null bool 11 ca 303 non-null bool 12 thal 303 non-null bool 13 target 303 non-null bool dtypes: bool(14) memory usage: 4.3 KB
# CODE HERE
df.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 303.0 | 54.366337 | 9.082101 | 29.0 | 47.5 | 55.0 | 61.0 | 77.0 |
sex | 303.0 | 0.683168 | 0.466011 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
cp | 303.0 | 0.966997 | 1.032052 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
trestbps | 303.0 | 131.623762 | 17.538143 | 94.0 | 120.0 | 130.0 | 140.0 | 200.0 |
chol | 303.0 | 246.264026 | 51.830751 | 126.0 | 211.0 | 240.0 | 274.5 | 564.0 |
fbs | 303.0 | 0.148515 | 0.356198 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
restecg | 303.0 | 0.528053 | 0.525860 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
thalach | 303.0 | 149.646865 | 22.905161 | 71.0 | 133.5 | 153.0 | 166.0 | 202.0 |
exang | 303.0 | 0.326733 | 0.469794 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
oldpeak | 303.0 | 1.039604 | 1.161075 | 0.0 | 0.0 | 0.8 | 1.6 | 6.2 |
slope | 303.0 | 1.399340 | 0.616226 | 0.0 | 1.0 | 1.0 | 2.0 | 2.0 |
ca | 303.0 | 0.729373 | 1.022606 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
thal | 303.0 | 2.313531 | 0.612277 | 0.0 | 2.0 | 2.0 | 3.0 | 3.0 |
target | 303.0 | 0.544554 | 0.498835 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
TASK: Create a bar plot that shows the total counts per target value.
# CODE HERE!
sns.countplot(data=df,x='target')
<AxesSubplot:xlabel='target', ylabel='count'>
TASK: Create a pairplot that displays the relationships between the following columns:
['age','trestbps', 'chol','thalach','target']
Note: Running a pairplot on everything can take a very long time due to the number of features
# CODE HERE
cp=df[['age','trestbps', 'chol','thalach','target']]
sns.pairplot(data=cp,hue='target')
<seaborn.axisgrid.PairGrid at 0x1870783ac08>
TASK: Create a heatmap that displays the correlation between all the columns.
# CODE HERE
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis')
<AxesSubplot:>
# CODE HERE
X=df.drop('target',axis=1)
y=df['target']
TASK: Perform a train test split on the data, with the test size of 10% and a random_state of 101.
# CODE HERE
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=101)
TASK: Create a StandardScaler object and normalize the X train and test set feature data. Make sure you only fit to the training data to avoid data leakage (data knowledge leaking from the test set).
# CODE HERE
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)
TASK: Create a Logistic Regression model and use Cross-Validation to find a well-performing C value for the hyper-parameter search. You have two options here, use LogisticRegressionCV OR use a combination of LogisticRegression and GridSearchCV. The choice is up to you.
# CODE HERE
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
from sklearn.model_selection import GridSearchCV
penalty = ['l1', 'l2']
C = np.logspace(0, 4, 10)
grid_model=GridSearchCV(model,param_grid={'C':C,'penalty':penalty})
grid_model.fit(x_train,y_train)
C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\model_selection\_validation.py:372: FitFailedWarning: 50 fits failed out of a total of 100. The score on these train-test partitions for these parameters will be set to nan. If these failures are not expected, you can try to debug them by setting error_score='raise'. Below are more details about the failures: -------------------------------------------------------------------------------- 50 fits failed with the following error: Traceback (most recent call last): File "C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit solver = _check_solver(self.solver, self.penalty, self.dual) File "C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\linear_model\_logistic.py", line 449, in _check_solver % (solver, penalty) ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty. warnings.warn(some_fits_failed_message, FitFailedWarning) C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\model_selection\_search.py:972: UserWarning: One or more of the test scores are non-finite: [ nan 0.82363636 nan 0.83104377 nan 0.82734007 nan 0.82734007 nan 0.82734007 nan 0.82734007 nan 0.82734007 nan 0.82734007 nan 0.82734007 nan 0.82734007] category=UserWarning,
GridSearchCV(estimator=LogisticRegression(), param_grid={'C': array([1.00000000e+00, 2.78255940e+00, 7.74263683e+00, 2.15443469e+01, 5.99484250e+01, 1.66810054e+02, 4.64158883e+02, 1.29154967e+03, 3.59381366e+03, 1.00000000e+04]), 'penalty': ['l1', 'l2']})
TASK: Report back your search's optimal parameters, specifically the C value.
Note: You may get a different value than what is shown here depending on how you conducted your search.
# CODE HERE
grid_model.best_params_
{'C': 2.7825594022071245, 'penalty': 'l2'}
TASK: Report back the model's coefficients.
grid_model.best_estimator_.coef_
array([[-0.06862347, -0.76677567, 0.92401506, -0.27433714, -0.22673577, 0.04684481, 0.12315594, 0.44657231, -0.43416162, -0.53866102, 0.39453632, -0.88123288, -0.58989011]])
BONUS TASK: We didn't show this in the lecture notebooks, but you have the skills to do this! Create a visualization of the coefficients by using a barplot of their values. Even more bonus points if you can figure out how to sort the plot! If you get stuck on this, feel free to quickly view the solutions notebook for hints, there are many ways to do this, the solutions use a combination of pandas and seaborn.
label=df.drop('target',axis=1).columns
df=pd.DataFrame(columns=label,data=grid_model.best_estimator_.coef_)
df
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.068623 | -0.766776 | 0.924015 | -0.274337 | -0.226736 | 0.046845 | 0.123156 | 0.446572 | -0.434162 | -0.538661 | 0.394536 | -0.881233 | -0.58989 |
plt.figure(figsize=(12,10))
sns.barplot(df)
<AxesSubplot:>
TASK: Let's now evaluate your model on the remaining 10% of the data, the test set.
TASK: Create the following evaluations:
# CODE HERE
from sklearn.metrics import plot_confusion_matrix,confusion_matrix,classification_report
y_pred=grid_model.predict(x_test)
confusion_matrix(y_test,y_pred)
array([[12, 3], [ 2, 14]], dtype=int64)
# CODE HERE
# that means we have predicted 12 true negative , 2 as false nehative , 3 as false positive 12 as true positive
plot_confusion_matrix(grid_model,x_test,y_test)
C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1870c2d4a48>
# CODE HERE
print(classification_report(y_test,y_pred))
precision recall f1-score support 0 0.86 0.80 0.83 15 1 0.82 0.88 0.85 16 accuracy 0.84 31 macro avg 0.84 0.84 0.84 31 weighted avg 0.84 0.84 0.84 31
TASK: Create both the precision recall curve and the ROC Curve.
# CODE HERE
from sklearn.metrics import plot_roc_curve , plot_precision_recall_curve
plot_precision_recall_curve(grid_model,x_test,y_test)
C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_precision_recall_curve is deprecated; Function `plot_precision_recall_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: PrecisionRecallDisplay.from_predictions or PrecisionRecallDisplay.from_estimator. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x1870c2d7308>
# CODE HERE
plot_roc_curve(grid_model,x_test,y_test)
C:\Users\Aas03\anaconda3\envs\hadeel_en\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_roc_curve is deprecated; Function :func:`plot_roc_curve` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: :meth:`sklearn.metric.RocCurveDisplay.from_predictions` or :meth:`sklearn.metric.RocCurveDisplay.from_estimator`. warnings.warn(msg, category=FutureWarning)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x1870c37b848>
Final Task: A patient with the following features has come into the medical office:
age 48.0
sex 0.0
cp 2.0
trestbps 130.0
chol 275.0
fbs 0.0
restecg 1.0
thalach 139.0
exang 0.0
oldpeak 0.2
slope 2.0
ca 0.0
thal 2.0
TASK: What does your model predict for this patient? Do they have heart disease? How "sure" is your model of this prediction?
For convience, we created an array of the features for the patient above
patient = [[ 54. , 1. , 0. , 122. , 286. , 0. , 0. , 116. , 1. ,
3.2, 1. , 2. , 2. ]]
# EXPECTED PREDICTION
grid_model.predict(patient)
array([0], dtype=int64)
# EXPECTED PROBABILITY PER CLASS (Basically model should be extremely sure its in the 0 class)
grid_model.predict_proba(patient)
array([[1.00000000e+00, 7.74916259e-25]])